AITopics

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
North America > United States > New York (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(4 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science (0.68)

Neural Information Processing SystemsDec-24-2025, 02:43:26 GMT

Characterizing Optimal Mixed Policies: Where to Intervene and What to Observe

Intelligent agents are continuously faced with the challenge of optimizing a policy based on what they can observe (see) and which actions they can take (do) in the environment where they are deployed. Most policy can be parametrized in terms of these two dimensions, i.e., as a function of what can be seen and done given a certain situation, which we call a \textit{mixed policy}. In this paper, we investigate several properties of the class of mixed policies and provide an efficient and effective characterization, including optimality and non-redundancy. Specifically, we introduce a graphical criterion to identify unnecessary contexts for a set of actions, leading to a natural characterization of non-redundancy of mixed policies. We then derive sufficient conditions under which one strategy can dominate the other with respect to their maximum achievable expected rewards (optimality). This characterization leads to a fundamental understanding of the space of mixed policies and a possible refinement of the agent's strategy so that it converges to the optimum faster and more robustly. One surprising result of the causal characterization is that the agent following a more standard approach --- intervening on all intervenable variables and observing all available contexts --- may be hurting itself, and will never achieve an optimal performance.

characterizing optimal, name change, policy, (5 more...)

Technology: Information Technology > Artificial Intelligence (0.63)

Neural Information Processing SystemsOct-3-2025, 01:36:52 GMT

Characterizing Optimal Mixed Policies: Where to Intervene and What to Observe

Most policies can be parametrized in terms of these two dimensions, i.e., as a function of what can be seen and done

artificial intelligence, machine learning, mixed policy, (16 more...)

Country:

North America > United States (0.46)
North America > Canada (0.28)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science (0.68)

Neural Information Processing SystemsOct-10-2024, 08:50:55 GMT

Characterizing Optimal Mixed Policies: Where to Intervene and What to Observe

characterization, characterizing optimal, mixed policy, (1 more...)

Technology: Information Technology > Artificial Intelligence (0.67)

Nishimura, Naoki, Kobayashi, Ken, Nakata, Kazuhide

Balancing Immediate Revenue and Future Off-Policy Evaluation in Coupon Allocation

arXiv.org Artificial IntelligenceJul-17-2024

Coupon allocation drives customer purchases and boosts revenue. However, it presents a fundamental trade-off between exploiting the current optimal policy to maximize immediate revenue and exploring alternative policies to collect data for future policy improvement via off-policy evaluation (OPE). While online A/B testing can validate new policies, it risks compromising short-term revenue. Conversely, relying solely on an exploitative policy hinders the ability to reliably estimate and enhance future policies. To balance this trade-off, we propose a novel approach that combines a model-based revenue maximization policy and a randomized exploration policy for data collection. Our framework enables flexibly adjusting the mixture ratio between these two policies to optimize the balance between short-term revenue and future policy improvement. We formulate the problem of determining the optimal mixture ratio between a model-based revenue maximization policy and a randomized exploration policy for data collection. We empirically verified the effectiveness of the proposed mixed policy using both synthetic and real-world data. Our main contributions are: (1) Demonstrating a mixed policy combining deterministic and probabilistic policies, flexibly adjusting the data collection vs. revenue trade-off. (2) Formulating the optimal mixture ratio problem as multi-objective optimization, enabling quantitative evaluation of this trade-off. By optimizing the mixture ratio, businesses can maximize revenue while ensuring reliable future OPE and policy improvement. This framework is applicable in any context where the exploration-exploitation trade-off is relevant.

artificial intelligence, data collection policy, machine learning, (18 more...)

2407.11039

Country: Asia > Japan (0.29)

Genre: Research Report (1.00)

Industry: Energy > Oil & Gas > Upstream (0.49)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceFeb-5-2024

Model-Free, Regret-Optimal Best Policy Identification in Online CMDPs

Zhou, Zihan, Wei, Honghao, Ying, Lei

This paper considers the best policy identification (BPI) problem in online Constrained Markov Decision Processes (CMDPs). We are interested in algorithms that are model-free, have low regret, and identify an approximately optimal policy with a high probability. Existing model-free algorithms for online CMDPs with sublinear regret and constraint violation do not provide any convergence guarantee to an optimal policy and provide only average performance guarantees when a policy is uniformly sampled at random from all previously used policies. In this paper, we develop a new algorithm, named Pruning-Refinement-Identification (PRI), based on a fundamental structural property of CMDPs proved before, which we call limited stochasticity. The property says for a CMDP with $N$ constraints, there exists an optimal policy with at most $N$ stochastic decisions. The proposed algorithm first identifies at which step and in which state a stochastic decision has to be taken and then fine-tunes the distributions of these stochastic decisions. PRI achieves trio objectives: (i) PRI is a model-free algorithm; and (ii) it outputs an approximately optimal policy with a high probability at the end of learning; and (iii) PRI guarantees $\tilde{\mathcal{O}}(H\sqrt{K})$ regret and constraint violation, which significantly improves the best existing regret bound $\tilde{\mathcal{O}}(H^4 \sqrt{SA}K^{\frac{4}{5}})$ under a model-free algorithm, where $H$ is the length of each episode, $S$ is the number of states, $A$ is the number of actions, and the total number of episodes during learning is $2K+\tilde{\cal O}(K^{0.25}).$

algorithm, constraint violation, optimal policy, (15 more...)

2309.15395

Country:

North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)
North America > United States > Washington (0.04)
Europe > Netherlands > South Holland > Leiden (0.04)

Genre: Research Report (0.82)

Industry: Health & Medicine (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.47)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.34)

arXiv.org Artificial IntelligenceDec-15-2023

Communication-Efficient Soft Actor-Critic Policy Collaboration via Regulated Segment Mixture in Internet of Vehicles

Yu, Xiaoxue, Li, Rongpeng, Liang, Chengchao, Zhao, Zhifeng

Multi-Agent Reinforcement Learning (MARL) has emerged as a foundational approach for addressing diverse, intelligent control tasks, notably in autonomous driving within the Internet of Vehicles (IoV) domain. However, the widely assumed existence of a central node for centralized, federated learning-assisted MARL might be impractical in highly dynamic environments. This can lead to excessive communication overhead, potentially overwhelming the IoV system. To address these challenges, we design a novel communication-efficient and policy collaboration algorithm for MARL under the frameworks of Soft Actor-Critic (SAC) and Decentralized Federated Learning (DFL), named RSM-MASAC, within a fully distributed architecture. In particular, RSM-MASAC enhances multi-agent collaboration and prioritizes higher communication efficiency in dynamic IoV system by incorporating the concept of segmented aggregation in DFL and augmenting multiple model replicas from received neighboring policy segments, which are subsequently employed as reconstructed referential policies for mixing. Distinctively diverging from traditional RL approaches, with derived new bounds under Maximum Entropy Reinforcement Learning (MERL), RSM-MASAC adopts a theory-guided mixture metric to regulate the selection of contributive referential policies to guarantee the soft policy improvement during communication phase. Finally, the extensive simulations in mixed-autonomy traffic control scenarios verify the effectiveness and superiority of our algorithm.

agent, communication, policy improvement, (15 more...)

2312.10123

Country:

Oceania > Australia > New South Wales > Sydney (0.14)
Europe > Switzerland > Zürich > Zürich (0.14)
North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
(19 more...)

Genre: Research Report (0.50)

Industry: Transportation > Ground > Road (0.66)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

arXiv.org Artificial IntelligenceJul-17-2023

A mixed policy to improve performance of language models on math problems

Chen, Gang

When to solve math problems, most language models take a sampling strategy to predict next word according conditional probabilities. In the math reasoning step, it may generate wrong answer. Considering math problems are deterministic, we propose a mixed policy exploration approach to solve math problems with reinforcement learning. In peculiar, we propose a two level token exploration policy: the abstract level explores next token with probability and the second level is deterministic. Specifically, the abstract level policy will decide whether the token is operator or operand with probability sampling, while the second level is deterministic to select next token with the highest score in a greedy way. We test our method on GSM8K dataset with GPT-2 model, and demonstrate more than $2\%$ performance gain. Our implementation is available at https://github.com/vividitytech/math_lm_rl.

large language model, machine learning, natural language, (17 more...)

2307.08767

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)

Fox, James, MacDermott, Matt, Hammond, Lewis, Harrenstein, Paul, Abate, Alessandro, Wooldridge, Michael

On Imperfect Recall in Multi-Agent Influence Diagrams

arXiv.org Artificial IntelligenceJul-11-2023

Multi-agent influence diagrams (MAIDs) are a popular game-theoretic model based on Bayesian networks. In some settings, MAIDs offer significant advantages over extensive-form game representations. Previous work on MAIDs has assumed that agents employ behavioural policies, which set independent conditional probability distributions over actions for each of their decisions. In settings with imperfect recall, however, a Nash equilibrium in behavioural policies may not exist. We overcome this by showing how to solve MAIDs with forgetful and absent-minded agents using mixed policies and two types of correlated equilibrium. We also analyse the computational complexity of key decision problems in MAIDs, and explore tractable cases. Finally, we describe applications of MAIDs to Markov games and team situations, where imperfect recall is often unavoidable.

artificial intelligence, machine learning, maid, (17 more...)

doi: 10.4204/EPTCS.379.17

2307.05059

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (0.64)

Industry: Leisure & Entertainment > Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)

arXiv.org Machine LearningNov-20-2019

Apprenticeship Learning via Frank-Wolfe

Zahavy, Tom, Cohen, Alon, Kaplan, Haim, Mansour, Yishay

T om Zahavy, Alon Cohen, Haim Kaplan and Yishay Mansour Google Research, Tel Aviv Abstract We consider the applications of the Frank-Wolfe (FW) algorithm for Apprenticeship Learning (AL). In this setting, we are given a Markov Decision Process (MDP) without an explicit reward function. Instead, we observe an expert that acts according to some policy, and the goal is to find a policy whose feature expectations are closest to those of the expert policy. We formulate this problem as finding the projection of the feature expectations of the expert on the feature expectations polytope - the convex hull of the feature expectations of all the deterministic policies in the MDP . We show that this formulation is equivalent to the AL objective and that solving this problem using the FW algorithm is equivalent well-known Projection method of Abbeel and Ng (2004). This insight allows us to analyze AL with tools from convex optimization literature and derive tighter convergence bounds on AL. Specifically, we show that a variation of the FW method that is based on taking "away steps" achieves a linear rate of convergence when applied to AL and that a stochastic version of the FW algorithm can be used to avoid precise estimation of feature expectations. We also experimentally show that this version outperforms the FW baseline. To the best of our knowledge, this is the first work that shows linear convergence rates for AL. 1 Introduction We consider sequential decision making in the Markov decision process (MDP) formalism. Given an MDP, the optimal policy and its value function are characterized by the Bellman equations and can be computed via value or policy iteration. This makes the MDP model useful in problems where we can specify the MDP model (states, actions, reward, transitions) appropriately. However, in many real-world problems, it is often hard to define a reward function, such that the optimal policy with respect to this reward produces the desired behavior.

algorithm, convergence, feature expectation, (17 more...)

arXiv.org Machine Learning

1911.01679

Country:

Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.24)
Europe > Russia (0.04)
Asia > Russia (0.04)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)